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We Claim: 

1 . A system for fault-tolerant processing, comprising: 
a processor unit; 

computer instructions executable by the processor unit and operable to: 

detect at least one of: failure of other processor units in the system, and 
connectivity failures that disrupt communications between the 
processor units; 

evaluate connectivity condition scores (CCSs) for the processor units, 

wherein the processor units are operable to communicate with each 
other via at least two communication paths, and the CCSs indicate 
connectivity errors experienced on each of the communication 
paths; 

determine at least two candidate groups with the same number of at least a 

portion of the processor units to include in the system; and 
select between the at least two candidate groups based on the CCSs. 

2. The system of Claim 1 , wherein the processor units in each candidate group are 
capable of communicating with the other processor units in the candidate group. 

3. The system of Claim 1 , wherein the severity of each connectivity error is factored 
into the corresponding CCS. 

4. The system of Claim 1 , wherein at least one of the CCSs is based on the history of 
connectivity errors on the corresponding communication path. 

5. The system of Claim 1 , further comprising: 

computer instructions executable by the processor unit and operable to: 
unpack a bit mask of normalized CCSs from each processor unit. 
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6. The system of Claim 1, further comprising: 

computer instructions executable by the processor unit and operable to: 

form a bi-directional CCS for each processor unit based on normalized 
CCSs; and 

select between the two candidate groups to include in the system based on 
the bi-directional CCSs for the processor units in each candidate 
group. 

7. A system for fault-tolerant processing, comprising: ^ 

a processor unit configurable to communicate with other components in the 
system via at least two switching fabrics; andcomputer instructions 
executable by the processor unit and operable to: 

maintain a connectivity condition score (CCS) for each communication 
path along the at least two fabrics based on connectivity errors 
experienced on the path, wherein the CCSs are utilized to 
determine whether the processor unit will continue to be included 
in the system. 

8. The system of Claim 7, wherein the severity of each connectivity error is factored 
into the corresponding CCS. 

9. The system of Claim 7, wherein the number of connectivity errors during previous 
observation time periods are factored into the corresponding CCS during an observation 
time period. 

10. The system of Claim 7, wherein the processor unit is further configured to 
communicate the CCSs to at least one of the other components in the system. 

1 1 . The system of Claim 7, further comprising: 

computer instructions executable by the processor unit and operable to: 
summarize each set of CCSs into a single score. 
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13. 



14. 



15. 



The system of Claim 1 1 , further comprising: 

computer instructions executable by the processor unit and operable to: 
normalize each set of CCSs based on the single score. 

The system of Claim 7, further comprising: 

computer instructions executable by the processor unit and operable to: 
transform the normalized CCSs into a condensed format. 



A computer product, comprising: / 
data structures including: 

a connectivity condition score (CCS) for each communication path 

associated with a processor unit in a distributed processing system, 
wherein the CCS indicates the connectivity condition of the 
communication path during at least one observation period; and 
a connectivity matrix indicating whether the processor unit is able to 

communicate with other components in the system through any of 
the communication paths. 

The computer product of Claim 14, further comprising: 

a single score representing the sum of the CCSs for the processor unit. 

The computer product of Claim 14, wherein each CCS is normalized and stored in 
ask. 

A method for regrouping processor units in a fault-tolerant system, comprising: 



determining the ability of each processor unit to communicate with other / 

processor units in the system; 
forming at least two candidate groups with the same number of processor units 

that are able to communicate with each other; and 
evaluating connectivity condition scores (CCSs) for each candidate group of the 

processor units, wherein each CCS indicates the connectivity condition of 

one communication path associated with the corresponding processor unit. 
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1 8. The method of Claim 1 7, wherein the CCS is based on the number of connectivity 
errors experienced by the corresponding communication path. 

19. The method of Claim 17, wherein at least one of the CCSs is based on the history 
of connectivity errors experienced by the corresponding communication path. 

20. The method of Claim 1 8, wherein the severity of each connectivity error is 
factored into the corresponding CCS. 

2 1 . The method of Claim 1 8, further comprising: 

forming a bi-directional CCS for each processor unit; and 

selecting between the at least two candidate groups to include in the system based 
on the sum of the bi-directional CCSs for the processor units in each 
group. 

22. The method of Claim 2 1 , further comprising: 

selecting an arbitrary one of the at least two candidate groups when the candidate 
groups have the same sum of bi-directional CCSs. 

23. An apparatus for regrouping processor units in a fault-tolerant system, 
comprising: / 

means for forming at least two candidate groups of processor units that are able to 

communicate with each other; and 
means for evaluating connectivity condition scores (CCSs) for each candidate 

group of the processor units, wherein each CCS indicates the severity of 

connectivity errors experienced by one communication path associated 

with the corresponding processor unit; and 
means for selecting one of the at least two candidate groups based on the CCSs. 

24. The apparatus of Claim 23, further comprising means for counting the number of 
connectivity errors experienced by the corresponding communication path during an 
observation period. 
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25. The apparatus of Claim 23, further comprising means for factoring into the CCS 
connectivity errors experienced by the corresponding communication path during at least 
one previous observation period. 

26. The apparatus of Claim 23, further comprising means for selecting a candidate 
group based on the survival priority of the processor units included in each candidate 
group. 

27. The apparatus of Claim 26, further comprising means for selecting a candidate 
group based on the CCSs, when both candidate groups have the highest number of 
processor units and/or processor units with the highest survival priority. 
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